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We describe a new generation of algorithms capable of mapping the struc- 
ture and conformations of macromolecules and their complexes from large 
ensembles of heterogeneous snapshots, and demonstrate the feasibility of 
determining both discrete and continuous macromolecular conformational 
spectra. These algorithms naturally incorporate conformational heterogeneity 
without resort to sorting and classification, or prior knowledge of the type of 
heterogeneity present. They are applicable to single-particle diffraction and 
image datasets produced by X-ray lasers and cryo-electron microscopy, respect- 
ively, and particularly suitable for systems not easily amenable to purification 
or crystallization. 
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1. Introduction 

Biomolecular interactions, through the formation of transient or robust complexes, 
are at the centre of cellular function and life itself. There is increasing recognition 
that biological macromolecular complexes exist in a range of conformations, and 
that these can play a vital role in their function. The virulence of the dengue 
virus, for example, strongly depends on transitions in its protein contacts and con- 
formational rearrangements [1]. A deep understanding of the nature and role of 
conformational variety in biological function would revolutionize our knowledge 
of key processes ranging from basic cell function to pathological states. 

Despite powerful contributions to the study of proteins and some complexes, 
X-ray crystallography and NMR have limitations. With notable exceptions, the 
constraints imposed by crystals have limited the role of X-ray crystallography 
in elucidating conformational variety. NMR, while able to study conformations 
in biomolecules of modest size, has not been extensively applied to larger systems. 
Cryo-electron microscopy (cryo-EM), fortified with increasingly sophisticated 
algorithmic approaches [2-4], has been extensively used to study macromolecu- 
lar complexes. However, conformational variety presents a challenge to cryo-EM 
methods, which often assume its absence. When conformational variety has been 
explicitly addressed, the results, won with effort and ingenuity, have provided 
tantalizing evidence of a rich variety of conformations, even in well-studied sys- 
tems [4-6]. The difficulties faced in addressing macromolecular complexes and 
interactions by otherwise successful structural means have led to the recognition 
that dealing with heterogeneity represents an important challenge in need of 
urgent attention. For example, there is currently no direct means for mapping 
the continuum of three-dimensional structures assumed by conformationally 
flexible complexes, such as the therapeutically important G protein-coupled 
receptors [6]. 

Heterogeneity is pervasive, because the observation of an ensemble of macro- 
molecules in reaction or interaction naturally leads to a collection of snapshots 
from non-identical objects. The ability to extract structural information from 
large datasets of snapshots obtained from configurationally and conformatio- 
nally heterogeneous ensembles of complexes would substantially advance our 
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understanding of macromolecular conformations and their 
role in biology. However, structure recovery methods such as 
cryo-EM [7,8] and emerging X-ray free-electron laser (XFEL) 
'diffract-then-destroy' approaches [9-13] are generally predi- 
cated on viewing a series of identical objects from different 
angles. The obvious method of 'sorting' the data into classes, 
each stemming from nominally identical objects is fraught 
with difficulty: the number and types of classes are often 
unknown; sorting must be performed at very low signal-to- 
noise ratios ('SNR' less than or equal to 0.1); and residual 
heterogeneities persist, even when the classes are small. 
Indeed, there is evidence that the resolution achieved by 
cryo-EM is often limited by sample heterogeneity. 

Heterogeneity can be tackled by sorting with reference to 
templates, which often can only be guessed at. The dangers in 
this approach are well known [8]. For example, the image of 
any individual, say Einstein or Newton, can emerge from 
random noise, depending on whose portrait was used as a 
template. Approaches based on Bayesian inference and maxi- 
mum-likelihood are powerful [4], but inherently favour 
the discovery of discrete conformations. Their computatio- 
nal expense and scaling behaviour also limit their practical 
application to a small number of conformations. In short, 
the analysis of existing cryo-EM datasets and those emerg- 
ing from XFEL techniques is severely hampered by the 
absence of algorithms able to deal naturally and efficiently 
with heterogeneity. 

Methods recently developed in our group offer 
the possibility to extract structural and conformational 
information from heterogeneous datasets directly and effi- 
ciently [14-18]. These methods combine techniques from 
Riemannian geometry, graph-theoretical dimensionality 
reduction ('manifold embedding') and scattering physics. 
Fundamentally, however, they are based on the simple recog- 
nition that each snapshot in a heterogeneous dataset provides 
information about all states of the system under observa- 
tion. For example, the view from the back of a person's head 
has valuable information about the full-frontal view, because 
it reveals where the ears are — irrespective of whether the 
person is smiling or not. Thus, the entire dataset can be used 
to reconstruct each state of the system, even when the dataset 
is heterogeneous. This approach substantially increases the 
available information, allowing one to operate at significantly 
lower signal levels than needed today. And it offers the possi- 
bility to use the information from all conformations to recover 
the three-dimensional structure of each. 

Here, we describe our approach and present results 
demonstrating its capabilities in the context of simulated 
XFEL diffraction and cryo-EM image snapshots. Section 2 
outlines the approach in conceptual terms. Mathematical 
underpinning can be found in [17,18] and the references 
therein. Section 3 presents results on simulated diffraction 
snapshots of an enzyme undergoing large conformational 
changes. More subtle changes are likely to remain beyond 
the reach of XFEL experiments for some time. Section 4 
describes results on simulated ultra-low signal cryo-EM snap- 
shots of objects undergoing more subtle conformational 
changes, imaged in the presence of large systematic changes 
such as defocus variation. These results pertain to structures 
often used to benchmark different approaches, in order to 
facilitate comparative assessment. We discuss our results 
and future challenges in §5, and summarize and conclude 
the paper in §6. 



2. Conceptual outline of approach 

Our approach is able to recover three-dimensional structural 
and conformational information from ultra-low signal, hetero- 
geneous datasets without templates or pre-classification. This 
stems from the recognition that (i) datasets from ensembles 
of macromolecules in reaction or interaction are necessarily 
heterogeneous and (ii) the information content of the entire 
dataset can be used to recover the three-dimensional structure 
corresponding to each conformation of the system. Rather than 
avoiding heterogeneity by careful experimental means or 
through sorting the snapshots into nominally identical classes, 
the approach exploits heterogeneity to increase the available 
information substantially. 

A more technical, but equally important aspect rests on the 
recent discovery that snapshots produced by scattering exper- 
iments reside on data manifolds with specific symmetries [17]. 
These symmetries stem from the nature of operations in space, 
rather than from the object itself, and are thus entirely general. 
This allows one to project complex, noisy datasets on known 
manifolds, much as one fits data to a polynomial of known 
type. Because the fit is determined by the entire dataset, this 
represents an efficient and noise-robust means of extracting 
information. In principle, multiple species result in multiple 
manifolds, and the properties of each manifold ('coefficients 
of each fit') can be used to deduce the three-dimensional struc- 
ture and conformational continuum of each species. Slightly 
more technically, modern graph-theoretic manifold embed- 
ding techniques [16,19-27] can be used to find manifolds 
produced by scattering. Similar manifolds have been pre- 
viously encountered in certain general relativistic models of 
the universe, and are thus well known [28,29]. Laplacian eigen- 
functions of manifolds produced by scattering can be deduced 
from such models, and used to extract structural and con- 
formational information from scattering data [17,18]. Our 
algorithms are noise-robust, computationally efficient, work 
with existing and emerging large datasets comprising up to 
20 million snapshots, and can be incorporated into existing 
structure recovery platforms for enhanced reach and impact. 

Our approach can be simply understood by considering, 
for example, a particle with three orientational degrees of 
freedom. As the particle orientation is changed, the changes 
in the pixel intensities are a function of only three parameters. 
This imposes a strong correlation among the pixel intensities, 
which can be used to determine the snapshot orientations, and 
thus determine the three-dimensional structure [16,18,30,31]. 
Specifically, a snapshot consisting of p pixels can be represented 
as a pdimensional vector, with each component representing 
the intensity value at a pixel. The fact that the intensities are a 
function of only three parameters means that the ^-dimensional 
vector tips all lie on a three-dimensional hypersurface 
('manifold') in the ^-dimensional space of intensities 
(figure 1). This manifold is an expression of the correlated 
way in which the pixel intensities change with the particle 
orientation. In fact, each point on the manifold represents a 
snapshot from a specific object orientation. 

Discovering the manifold in a noisy dataset constitutes the 
first step in the analysis. Starting with the Euclidean distance 
between vectors representing snapshots in the ^-dimensional 
data space, a number of powerful graph-theoretic techniques 
can be used to discover low-dimensional manifolds underlying 
the high-dimensional data [19,23-26,32-34]. Each algorithm 
has its strengths and limitations, with the most common 




Figure 1. Manifold as expression of correlation. An object has only three 
orientational degrees of freedom. This means that the p pixel intensities 
in a snapshot change in a correlated fashion with object orientation. This 
correlation is described by a three-dimensional manifold in the p-dimensional 
space of pixel intensities. (Online version in colour.) 

problem being noise sensitivity [35]. Our approach incorpor- 
ates three different manifold embedding approaches based 
on generative topographic mapping [19,20], Isomap [26] and 
Diffusion map [25,32,36]. In each case, extensive effort was 
required to achieve noise-robustness. The resulting algorithms 
can operate at signal levels as low as - 20 dB (1 / 100 on a linear 
scale), depending on the application [18]. 

Once the data manifold has been determined and 
embedded in a suitable space, one must discover how to inter- 
pret the outcome. Specifically, one needs to identify the 
operations connecting any two points on the manifold. For 
example, in order to reconstruct a three-dimensional image 
of a given conformation, one must identify all points on the 
manifold which can be reached by SO(3) operations (three- 
dimensional rotations) alone. Similarly, to map conformations, 
all points on the manifold connected by conformational oper- 
ations alone must be identified. This constitutes the second 
important step in our approach. 

Manifolds are best described in differential geometrical 
terms, with the metric — the local measure of distance — playing 
an important role. Using a differential geometric formulation of 
scattering, we have been able to relate changes in the data mani- 
fold to specific operations [18]. In non-technical terms, one would 
like to relate infinitesimal changes in the intensity distribution in 
a snapshot to the corresponding infinitesimal operations affect- 
ing the orientation and/ or conformation of the object. In other 
words, one would like to relate the metric of the data manifold 
to the metric of the manifold of operations. This would allow 
one to determine the rotation and /or conformation operations 
connecting any pair of snapshots. Achieving this is tantamount 
to having a model of the object, in the sense that, given any snap- 
shot, any other corresponding to a desired object orientation and 
conformation can be produced on demand. The problem, how- 
ever, is that the metric of data manifolds produced by 
scattering is not simply related to that of the manifold of oper- 
ations. For cases involving orientational changes only, we have 
solved this problem in two steps. First, we have shown that the 
metric of data manifolds produced by scattering onto a two- 
dimensional detector can be decomposed into two parts, one 
with high symmetry, plus an object-specific 'residual' with low 
symmetry [17]. Second, using results from general relativity 
and quantum mechanics, we have shown that the (Laplace - 
Beltrami) eigenfunctions of the high-symmetry part are directly 
related to those of the manifold of rotation operations under a 
wide range of scattering conditions [17]. This allows one to 
deduce the orientation corresponding to each snapshot. 
Figure 2 demonstrates three-dimensional structure recovery by 
this approach down to very low signal levels [18]. 



3. Conformations from diffraction snapshots 

We have previously shown that experimental XFEL diffrac- 
tion snapshots stemming from an unknown mixture of 
species can be sorted with high accuracy [15]. This potentially 
offers a post-processing route to mitigating the solution puri- 
fication problem. Here, we are concerned with determining 
conformations of a single macromolecule or macromolecular 
assembly. Experimental single-particle XFEL snapshots are 
currently dominated by extraneous effects, such as stochastic 
variations in the beam intensity and inclination, the diameter 
and position of the liquid jet containing the particles, and 
detector saturation and nonlinearity. These overwhelm the 
signal from the particle itself, and cannot be alleviated, for 
example, by increasing the incident beam intensity. Experi- 
ence in cryo-EM, however, has shown that such effects can 
be alleviated by advances in experimental and algorithmic 
techniques. Until improved datasets are generally available, 
algorithm development must rely on simulated snapshots. 

(a) Discrete conformations 

We have previously shown that when simulated single-particle 
XFEL snapshots emanate from different discrete conformations 
of the same complex, our approach automatically sorts the 
diffraction snapshots into separate conformational classes 
and determines their orientations [14]. Figure 3 shows the 
results of a manifold embedding analysis, when a mixture of 
randomly oriented diffraction snapshots from the closed and 
open conformations of the enzyme adenylate kinase (ADK, 
PDB identifier: lank and 4ake, respectively) were presented 
to the algorithm at the signal level corresponding to 0.04 pho- 
tons per Shannon pixel at 0.18 nm with shot noise [14]. Because 
of their chemical identity, the conformations of ADK are 
difficult to separate. As shown in figure 3, our algorithm auto- 
matically sorts the snapshots into different manifolds, and 
determines the orientations of the members of each set of snap- 
shots. No prior information was provided to the algorithm 
regarding the type or number of conformations present. 

The confidence with which sorting was performed can 
be deduced as follows. Noise causes the vectors representing 
the snapshots not to lie exactly on the manifolds, thus impart- 
ing a certain 'thickness 7 to each manifold. This can be 
quantified in terms of the widths (standard deviations) of 
the distributions of vectors about the manifolds. At the 
signal level of 0.04 photon per pixel with shot noise, the smal- 
lest separation between the two manifolds is approximately 
10 s.d. This means that diffraction snapshots from the differ- 
ent conformations are sorted with extreme fidelity, even in 
the presence of substantial noise. This level of confidence 
clearly cannot be expected with experimental data. Neverthe- 
less, results obtained with simulated data provide an 
indication of the efficiency with which different confor- 
mations of a molecule may be identified and separated. We 
note that larger objects, such as macromolecular complexes, 
produce larger signals, and should thus be more readily 
amenable to our approach. 

(b) Continuous conformations 

Macromolecular complexes are even more likely to possess con- 
formational continua than discrete conformations. However, 
one must walk before learning to run. We have therefore used 
the unfolding of ADK to demonstrate the principle of mapping 
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Figure 2. Three-dimensional reconstructions of chaperonin molecule from cryo-EM snapshots [18]. Left: noise-free simulation. Centre: with experimental images at 
20 electrons A -2 . Right: with images obtained by processing experimental snapshots to approximate a dose of 1.7 electrons A -2 . Insets show typical snapshots at 
each electron dose. (Online version in colour.) 
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Figure 3. Sorting snapshots from different conformations. A mixture of simulated diffraction snapshots from the molecule ADK in its open and closed conformations at 0.04 
photons per pixel at 0.18 nm with shot noise is automatically sorted into different manifolds and the orientation of each snapshot determined [14]. (Online version in colour.) 



conformational continua. The unfolding process of ADK was 
simulated by molecular dynamics as follows. The coordinates 
of ADK from E. coll in the open state (Protein Data Bank identi- 
fier: 4ake) were placed in a spherical droplet of water and 
simulated at a nominal temperature of 850 K using NAMD 
[37]. Diffraction snapshots (12500) were simulated from 100 
conformations, with each conformation assuming 125 orien- 
tations about one axis. Snapshots were provided to a modified 
version of the Isomap manifold embedding algorithm [26], 
and the resulting manifold displayed through its projections 
along the first three principal components (figure 4). It is clear 
that orientational and conformational variations give rise to a 
tubular manifold. Qualitatively, the closed cross sections of 
the tube represent orientational change, whereas paths termi- 
nating at the tube ends include conformational change. It can 
be readily shown that the manifold is Riemannian. Owing to 
the SO(3) symmetry operations involving molecular orientation, 
the manifold has SO(3) symmetry in some directions. Such 
manifold have received considerable attention in general relativ- 
ity [28,29]. We have demonstrated that these techniques, 
suitably modified, can be used to recover the three-dimensional 
structure of biological objects with computational complexities 
10 4 times higher than previously possible [17]. Following gen- 
eral relativistic models for the evolution of the universe [29], 




Figure 4. Manifold traced out by 100 conformations of an unfolding ADK 
molecule, each able to assume one of 125 orientations about a single 
axis. The molecular conformations corresponding to different points on the 
manifold are shown in the insets. (Online version in colour.) 



we are exploring extensions of this approach to mapping con- 
formational continua (A. Ourmazd et al. 2013, unpublished 
data). Figure 5, for example, shows projection on the first diffu- 
sion map eigenfunction of the manifold produced by snapshots 
of melting ADK free to assume any orientation in three 
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Figure 5. Eigenfunction revealing conformational information. The manifold was formed by diffraction snapshots of 12 conformations of melting ADK free to assume 
any orientation in three dimension. Projections on the first eigenfunction are shown as (a) a histogram, and (b) as a function of time. (Online version in colour.) 



dimensions. From the molecular dynamics simulation 
described above, 2.4 million diffraction snapshots were simu- 
lated from 12 conformations separated by equal time steps of 
37.5 ps, with each conformation assuming 200 000 orientations 
in three dimensional space. The snapshots were then analysed 
with the diffusion map algorithm. As depicted in figure 5a, 
the histogram clearly distinguishes all 12 conformations 
except the last two, which overlap. Thus, the first eigenvector 
directly provides conformational information, confirming 
the possibility to identify directions of orientational and con- 
formational change on the manifold. These results offer a 
potentially promising route to using heterogeneous XFEL 
datasets to map conformational continua in macromolecular 
complexes. The possibility to use single-particle techniques to 
map conformational continua is new, and expected to remove 
a major bottleneck in the study of complexes, where structural 
flexibility can play a prominent role. 

4. Conformations from cryo-electron microscopy 
images 

Conformational changes are, as a rule, far more subtle than 
those displayed by an unfolding macromolecule. In such 
cases, the effect of even small orientational changes can over- 
whelm the signal owing to typical conformational variations. 
Under these circumstances, a different approach is needed to 
map conformations. 

We now outline a manifold-based approach capable of 
sorting with high fidelity, simulated noisy single-particle 
cryo-EM snapshots of mildly heterogeneous particles, and 
demonstrate this capability in the context of ribosome com- 
plexes with and without growth elongation factor (EFG; 
figure 6). In order to facilitate comparison with the results 
obtained by other approaches, we use a dataset often used 



for benchmarking the performance of conformational sorting 
algorithms [38]. A mixture of 200 000 snapshots from ribo- 
some able to assume any orientation in three-dimensional 
with and without EFG was simulated at defocus values 
ranging from —1.5 to —2.5 |jim with added background and 
shot noise corresponding to an SNR of — 12 dB (0.06 on a 
linear scale). These parameters are typical of experimental 
cryo-EM snapshots. 

The analysis proceeds as follows. First, the snapshot 
orientations are determined, irrespective of the (unknown) 
conformational states of the particles. This is possible, 
because the effect of orientational change dominates. Any 
algorithm capable of determining orientation can be used 
for this purpose. Both standard cryo-EM [8] and manifold- 
based approaches are able to determine orientation with 
an accuracy of about one Shannon angle. Conformational 
discrimination is achieved by means of a special kernel for 
the diffusion map algorithm to extract the small confor- 
mational signal in the presence of large changes owing to 
orientation, viz. 

w, = |«p(-§)' A *<«- 

[ 0, otherwise, 

where W x y is the weighting factor for a pair of snapshots i and j 
separated by a great-circle angular distance A0,y and a modified 
Euclidean distance D,y (to be defined below), 0 e an upper 
bound for A0 x y and a the Gaussian kernel width. At a suffi- 
ciently small 0 e , the conformational signal dominates. By 
assigning zero weight to snapshot pairs separated by more 
than 0 e , this weighting scheme is primarily sensitive to 
changes in conformation only. To enhance this further, we 
retain only a small number of the shortest distances in the 
diffusion map analysis. 



Figure 6. Variance-normalized, simulated noisy cryo-EM images of ribosome with and without EFG in different orientations, in random order. (Online version in colour.) 



Because cryo-EM snapshots can also differ by the defocus 
at which they were obtained, the effect of the defocus on 
two otherwise identical snapshots must be eliminated. 
This is achieved by the following definition of the Euclidean 
distance D^-i 

D^^lPSF^Ji-PSF^Jyl 2 

V 

= |PSF ; <g> (PSF; (g) Pi) - PSF f 0 (PSFy (g) Pjf 

V 

= ^|PSF;®PSF ; -®{P;-P ; -}| 2 . 

V 

For each snapshot i, J z represents the image intensity dis- 
tribution, PSF Z is the microscope point-spread function and P z 
is the projected potential and ® is the convolution operator. 
This 'double-filtering' scheme ensures a zero Euclidean dis- 
tance between two snapshots stemming from the same 
projected potential, but differing in defocus values. For com- 
putational efficiency, the distances are calculated in Fourier 
space, so that convolution becomes multiplication. With J z - 
the Fourier transform of the image and CTF Z the Fourier 
transform of the point-spread function, application of 
Parseval's theorem [39] yields 

D| = ]T|CTF ; ^-CTF Z J/. 

Figure 7 shows the results obtained by appropriate 
embedding of ribosome with and without EFG, as outlined 
above, with following parameters: 0 S = 0.08 corresponding 
to two Shannon angles, the number of nearest neighbours 
retained = 3, and a determined as described in [40]. In this 
plot, each snapshot is represented by its coordinate in the 
plane defined by the second and third diffusion map eigen- 
vectors, with the snapshots coloured red and blue 
corresponding to ribosome with and without EFG, respect- 
ively. The cutting line (black) separates the two clusters 
with a sorting fidelity of 99.96%. 
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Figure 7. Two ribosome conformations (with and without EFG) are separated 
with 99.96% accuracy. (Online version in colour.) 

Table 1 shows a compilation of the results at constant and 
varying defocus, and different means of obtaining orientational 
information, in comparison with benchmarks from the 
literature [38,41,42]. 

In summary, these results show that the two conforma- 
tions can be identified with 99.96% accuracy in the presence 
of experimental noise and defocus variations, compared 
with the previously best published fidelity of 87% [38]. 

5. Discussion 

The techniques most commonly used to investigate confor- 
mations of macromolecules implicitly assume the presence 
of discrete conformations, sometimes requiring starting 'tem- 
plates' for each conformation, or at least some knowledge of 
the number of conformations present. The investigation 
of conformational continua, perhaps one of the most impor- 
tant aspects of molecular function, has remained difficult, if 
not beyond reach. The approach we have outlined offers a 



Table 1. Compilation of the results. 



defocus (— ve juim) 


orientational accuracy (Shannon) 


sorting fidelity (%) 


benchmarks (%) 


remarks 


2 


-1 


100 




known orientations 


2 


2 


99.9995 


99.7 


diff. map orientations 


1.5-2.5 


-1 


99.97 




known orientations 


1.5-2.5 


-1 


99.96 


87 


SPIDER orientations 



natural means of dealing with conformations, whether 
discrete or continuous, without bias or a priori knowledge. 
Both diffraction and image snapshots are amenable to this 
approach, even in the presence of overwhelming noise and 
(known) systematic variations such as defocus. 

The majority of results presented here, however, pertain 
to simulated snapshots, and successful experimental demon- 
stration remains an important future task. In the case of XFEL 
snapshots, this must include means of dealing with the 
effects of unknown stochastic variations in the intensity, 
position and inclination of the incident beam, and the geo- 
metrical parameters characterizing the way the particle was 
illuminated in each shot. 



6. Summary and conclusion 



We have described an approach naturally suited to investi- 
gating macromolecular conformations and conformational 



continua using heterogeneous sets of diffraction or image 
snapshots, without the need for prior assumptions regarding 
the nature of the conformational variety present. Simulations 
show the approach to be capable of operating with extreme 
fidelity at signal-to-noise levels typical of experimental data, 
at least in the case of cryo-EM snapshots. This offers a prom- 
ising route to investigating conformational variety in 
macromolecular systems and its role in biological function. 
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